-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix dask_cudf.read_parquet
regression for legacy timestamp data
#15929
Fix dask_cudf.read_parquet
regression for legacy timestamp data
#15929
Conversation
Note: Tried to add this to the proper "cuDF/Dask/..." project and get an error: |
So IIRC, if we do a deeper fix and allow |
Thanks for the review @mroeschke !
Yes. I would prefer a fix in cudf if you think that is reasonable :) |
Co-authored-by: Matthew Roeschke <[email protected]>
… timezone-read-parquet
I opened up #15935 to hopefully supersede this PR |
We had not archived any completed items recently. I just cleared out a backlog of around 600 "DONE" items, which should have helped. |
closes #13611 (This technically does not support pandas objects have interval types that are timezone aware) @rjzamora let me know if the test I adapted from your PR in #15929 is adequate Authors: - Matthew Roeschke (https://github.com/mroeschke) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #15935
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think (to the best of my knowledge of timezones), this makes sense.
/merge |
Description
cudf does not currently support timezone-aware datetime columns. For example:
However,
cudf.read_parquet
does allow you to read this same data from a Parquet file. This PR adds a simple fix to allow the same data to be read withdask_cudf
. The dask_cudf version was previously "broken" because it relies on upstream pyarrow logic to constructmeta
as a pandas DataFrame (and then we just convertmeta
from pandas to cudf). As illustrated in the example above, this direct conversion is not allowed when one or more columns contain timezone information.Important Context
The actual motivation for this PR is to fix a regression in 24.06+ for older parquet files containing "legacy" timestamp types (e.g.
TIMESTAMP_MILLIS
andTIMESTAMP_MICROS
). Inpyarrow 14.0.2
(used by cudf-24.04), these legacy types were not automatically translated to timezone-aware dtypes by pyarrow. Inpyarrow 16.1.0
(used by cudf-24.06+), the legacy types ARE automatically translated. Therefore, in moving from cudf-24.04 to cudf-24.06+, somedask_cudf
users will find that they can no longer read the same parquet file containing legacy timestamp data.I'm not entirely sure if cudf should always allow users to read Parquet data with timezone-aware dtypes (e.g. if the timezone is not utc), but it definitely makes sense for cudf to ignore automatic/unnecessary timezone translations.
Checklist